Dimensionality Reduction by Random Mapping: Fast Similarity Computation for Clustering
نویسندگان
چکیده
When the data vectors are high dimensional it is com putationally infeasible to use data analysis or pattern recognition algorithms which repeatedly compute simi larities or distances in the original data space It is therefore necessary to reduce the dimensionality before for example clustering the data If the dimensionality is very high like in the WEBSOM method which orga nizes textual document collections on a Self Organizing Map then even the commonly used dimensionality re duction methods like the principal component analysis may be too costly It will be demonstrated that the document classi cation accuracy obtained after the di mensionality has been reduced using a random mapping method will be almost as good as the original accuracy if the nal dimensionality is su ciently large about out of In fact it can be shown that the inner product similarity between the mapped vectors follows closely the inner product of the original vectors
منابع مشابه
Dimensionality Reduction by Random Mapping : Fast
When the data vectors are high-dimensional it is com-putationally infeasible to use data analysis or pattern recognition algorithms which repeatedly compute similarities or distances in the original data space. It is therefore necessary to reduce the dimensionality before, for example, clustering the data. If the dimensionality is very high, like in the WEBSOM method which organizes textual doc...
متن کاملRecursive nearest agglomeration (ReNA): fast clustering for approximation of structured signals
In this work, we revisit fast dimension reduction approaches, as with random projections and random sampling. Our goal is to summarize the data to decrease computational costs and memory footprint of subsequent analysis. Such dimension reduction can be very efficient when the signals of interest have a strong structure, such as with images. We focus on this setting and investigate feature clust...
متن کاملDocument Clustering: Before and After the Singular Value Decomposition
Document Clustering is an issue of measuring similarity between documents and grouping similar documents together. Information Retrieval (IR) is an issue of comparing query with a collection of documents to locate a set of documents relevant to a particular query. In the vector space IR model, a query is treated as a document which consists of a few terms. Therefore, in both clustering and retr...
متن کاملFast Transformation-Invariant Factor Analysis
Dimensionality reduction techniques such as principal component analysis and factor analysis are used to discover a linear mapping between high dimensional data samples and points in a lower dimensional subspace. In [6], Jojic and Frey introduced mixture of transformation-invariant component analyzers (MTCA) that can account for global transformations such as translations and rotations, perform...
متن کاملA fast and novel technique for color quantization using reduction of color space dimensionality
This paper describes a fast and novel technique for color quantization using reduction of color space dimensionality. The color histogram is repeatedly subdivided into smaller and smaller classes. The colors of each class are projected on a carefully selected line, such that the color dis-similarities are preserved. Instead of using the principal axis of each class, the line is de®ned by the me...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006